this dataset is about hotels booking in this project we need to analyze the data to extract insights from this raw data we will data cleansing to get trusted data to get high quality insight from the data and doing exploratory data analysis to visualize our insight
1- Where do guests came from?
2- How does price of night vary in hotels across the year?
3- Analysing prefrence of guests,what they basically prefere?
4- The guests increase in which month?
5- Analyzing booking by market segement.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os
import plotly.express as px
file = r'D:/FCAIH/DATASETS/hotel_bookings.xlsx'
df=pd.read_excel(file)
df.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
df.shape
(119390, 32)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119390 entries, 0 to 119389 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 arrival_date_year 119390 non-null int64 4 arrival_date_month 119390 non-null object 5 arrival_date_week_number 119390 non-null int64 6 arrival_date_day_of_month 119390 non-null int64 7 stays_in_weekend_nights 119390 non-null int64 8 stays_in_week_nights 119390 non-null int64 9 adults 119390 non-null int64 10 children 119386 non-null float64 11 babies 119390 non-null int64 12 meal 119390 non-null object 13 country 118902 non-null object 14 market_segment 119390 non-null object 15 distribution_channel 119390 non-null object 16 is_repeated_guest 119390 non-null int64 17 previous_cancellations 119390 non-null int64 18 previous_bookings_not_canceled 119390 non-null int64 19 reserved_room_type 119390 non-null object 20 assigned_room_type 119390 non-null object 21 booking_changes 119390 non-null int64 22 deposit_type 119390 non-null object 23 agent 103050 non-null float64 24 company 6797 non-null float64 25 days_in_waiting_list 119390 non-null int64 26 customer_type 119390 non-null object 27 adr 119390 non-null float64 28 required_car_parking_spaces 119390 non-null int64 29 total_of_special_requests 119390 non-null int64 30 reservation_status 119390 non-null object 31 reservation_status_date 119390 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(4), int64(16), object(11) memory usage: 29.1+ MB
df.drop(['company','children','babies','distribution_channel','is_repeated_guest','required_car_parking_spaces','previous_cancellations','previous_bookings_not_canceled'],axis=1,inplace=True)
# Removing duplicates from data
df.drop_duplicates(inplace=True)
# cheching for nulls
df.agent.isnull().sum()
11930
#replacing nulls with zero
df.agent.fillna(0,inplace=True)
#copy the data
dfc=df.copy()
dfc
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | assigned_room_type | booking_changes | deposit_type | agent | days_in_waiting_list | customer_type | adr | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | C | 3 | No Deposit | 0.0 | 0 | Transient | 0.00 | 0 | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | C | 4 | No Deposit | 0.0 | 0 | Transient | 0.00 | 0 | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | C | 0 | No Deposit | 0.0 | 0 | Transient | 75.00 | 0 | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | A | 0 | No Deposit | 304.0 | 0 | Transient | 75.00 | 0 | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | A | 0 | No Deposit | 240.0 | 0 | Transient | 98.00 | 1 | Check-Out | 2015-07-03 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | ... | A | 0 | No Deposit | 394.0 | 0 | Transient | 96.14 | 0 | Check-Out | 2017-09-06 |
| 119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | ... | E | 0 | No Deposit | 9.0 | 0 | Transient | 225.43 | 2 | Check-Out | 2017-09-07 |
| 119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | D | 0 | No Deposit | 9.0 | 0 | Transient | 157.71 | 4 | Check-Out | 2017-09-07 |
| 119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | A | 0 | No Deposit | 89.0 | 0 | Transient | 104.40 | 0 | Check-Out | 2017-09-07 |
| 119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | ... | A | 0 | No Deposit | 9.0 | 0 | Transient | 151.20 | 2 | Check-Out | 2017-09-07 |
86926 rows × 24 columns
# classifying data into two hotels
dfc_resort=dfc[dfc['hotel']=='Resort Hotel']
dfc_City=dfc[dfc['hotel']=='City Hotel']
def get_guests_for_eachCountry_for_hotel(hotel_df):
"""
this function take dataframe that filtered by hotel name and
return number of guests from eachCountry came to this hotel
"""
hotel_country=hotel_df['country'].value_counts().reset_index()
hotel_country.columns=['country','no_guests']
return hotel_country
# countries that came to resort hotel
resort_country=get_guests_for_eachCountry_for_hotel(dfc_resort)
resort_country
| country | no_guests | |
|---|---|---|
| 0 | PRT | 14026 |
| 1 | GBR | 5876 |
| 2 | ESP | 3442 |
| 3 | IRL | 1947 |
| 4 | FRA | 1414 |
| ... | ... | ... |
| 120 | BIH | 1 |
| 121 | MUS | 1 |
| 122 | COM | 1 |
| 123 | UGA | 1 |
| 124 | DJI | 1 |
125 rows × 2 columns
#resort_country=dfc_resort['country'].value_counts().reset_index()
#resort_country.columns=['country','no_guests']
#resort_country
def more_than_x_guests_per_country(hotel_df,x):
"""
this function take dataframe that filtered by hotel name and number of guests
that came from each country the benefits from this function it make visuals in map chart more
informative beacause we will not prefer to see 2 o 3 guests came from a country and this guests
affect the map we focus on alot of numbers of guests
"""
hotel_country=get_guests_for_eachCountry_for_hotel(hotel_df).query(f'no_guests >{x}')
return hotel_country
def geospatial_map(hotel_df,x,hotel_name):
"""
this function take dataframe that filtered by hotel name and hotel name
and plotting geospatial_map
"""
fig=px.choropleth(
more_than_x_guests_per_country(hotel_df,x),
locations=more_than_x_guests_per_country(hotel_df,x).country,
color=more_than_x_guests_per_country(hotel_df,x).no_guests,
hover_name=more_than_x_guests_per_country(hotel_df,x).country,
title=f'{hotel_name} country guest'
)
return fig.show()
geospatial_map(dfc_resort,1000,'resort hotel')
geospatial_map(dfc_City,1000,'city hotel')
#dfc_resort.columns
#getting average price of night per month
resort_prices=dfc_resort.groupby('arrival_date_month')['adr'].mean().reset_index()
resort_prices.columns=['month','AVG_price']
#getting average price of night per month
city_prices=dfc_City.groupby('arrival_date_month')['adr'].mean().reset_index()
city_prices.columns=['month','AVG_price']
# merging the two hotels by common column (month)
hotels=resort_prices.merge(city_prices,on='month')
# !pip install sort-dataframeby-monthorweek#IMPORTAAAAAAAAAAAAAAAANT
# !pip install sorted-months-weekdays
import sort_dataframeby_monthorweek as sd
# sorting by month "GREAT library for sorting by month"
hotels=sd.Sort_Dataframeby_Month(hotels,'month')
#renaming columns
hotels.columns=['month','AVG_price_resort_hotel','AVG_price_city_hotel']
hotels
| month | AVG_price_resort_hotel | AVG_price_city_hotel | |
|---|---|---|---|
| 0 | January | 49.282136 | 85.209038 |
| 1 | February | 54.244601 | 89.003009 |
| 2 | March | 57.670955 | 94.873190 |
| 3 | April | 79.512386 | 117.266459 |
| 4 | May | 80.746890 | 127.987557 |
| 5 | June | 112.424711 | 123.899318 |
| 6 | July | 156.329783 | 120.155187 |
| 7 | August | 187.658993 | 124.901541 |
| 8 | September | 100.993134 | 118.692065 |
| 9 | October | 63.830297 | 107.439936 |
| 10 | November | 48.986490 | 89.480841 |
| 11 | December | 65.379766 | 92.760386 |
px.line(hotels,x='month',y=['AVG_price_resort_hotel','AVG_price_city_hotel'],title='AVG Room Price per Night ')
value=dfc['meal'].value_counts()
name=dfc['meal'].value_counts().index
px.pie(dfc,values=value,names=name,hole=0.5)
busy_month_resort=dfc_resort.arrival_date_month.value_counts().reset_index()
busy_month_city=dfc_City.arrival_date_month.value_counts().reset_index()
busy_month_resort.columns=['month','visitors']
busy_month_city.columns=['month','visitors']
busy_month=busy_month_resort.merge(busy_month_city,on='month')
busy_month.rename(columns={'visitors_x':"visitors_resort",'visitors_y':'visitors_city'},inplace=True)
busy_month.head()
| month | visitors_resort | visitors_city | |
|---|---|---|---|
| 0 | August | 4648 | 6580 |
| 1 | July | 4289 | 5730 |
| 2 | May | 2916 | 5396 |
| 3 | April | 2795 | 5064 |
| 4 | June | 2749 | 4997 |
x=busy_month.month
px.bar(busy_month,x,y=['visitors_city','visitors_resort'],title='Count of Visitor per Month')
#sns.lineplot(busy_month,x,y=['visitors_resort','visitors_city'],title='Count of Visitor per Month')
dfc.market_segment.value_counts()
Online TA 51505 Offline TA/TO 13844 Direct 11723 Groups 4894 Corporate 4050 Complementary 682 Aviation 226 Undefined 2 Name: market_segment, dtype: int64
fig=px.bar(df, x=dfc.market_segment.value_counts(), y=dfc.market_segment.value_counts().index, orientation='h')
#for reversing the bars from down to up
fig.update_layout(yaxis={'categoryorder':'total ascending'})
#!jt -l
Available Themes: chesterish grade3 gruvboxd gruvboxl monokai oceans16 onedork solarizedd solarizedl
!jt -t grade3